See quick reference at the bottom
See full module reference section for full details

This part will show methods by which to read data into the ReproPhylo Project

3.3.1 Reading data from GenBank or EMBL files

GenBank or EMBL files should be the prefered way to read data from online databases because ReproPhylo can store all the associated metadata and make it available for steps such as tree annotation or even Bayestraits analysis. When we pass a GenBank file, only loci and feature types that match the loci we have passed upon creating the Project will be retained, and the rest will be ignored. This is handy for multi-featured GenBank entries that contain any number of genes on top of the ones we are interested in. In this example, only cox1 CDSs will be read from entries of complete mitochondrial genomes. First we read the pickle file from the section 3.2:


In [1]:
from reprophylo import *
pj = unpickle_pj('outputs/my_project.pkpj', git=False)


DEBUG:Cloud:Log file (/home/amir/.picloud/cloud.log) opened

Now we can add data to the project, by reading a list of one or more GenBank files:


In [2]:
input_filnames = ['data/Tetillidae1.gb', 'data/Tetillidae2.gb']
pj.read_embl_genbank(input_filnames)


/home/amir/Dropbox/python_modules/reprophylo.py:1015: UserWarning: Version control off
  warnings.warn('Version control off')

3.3.2 Reading other sequence file formats

When GenBank or EMBL files are read, the accession numbers are used as sequence ID in ReproPhylo. But when other file formats are used, it is difficult to predict whether a unique sequence ID is available in the sequence header. Therefore, ReproPhylo regards data read from other file formats as 'denovo' and creates denovo sequence IDs. For the same reason, there is no mechanism to prevent you from reading the same file twice, at the moment. All the information found in the original sequence header is retained and made available as metadata. ReproPhylo can handle any format that is compatible with the SeqIO module of Biopython. Reading prealigned sequences is done by a different dedicated method which will be discussed below.

3.3.2.1 Reading files

In this example we read a fasta file with an unpublished sequence. We will specify the data type ('dna') and the file format. This means that DNA and protein files need to be read in two separate actions.


In [3]:
# This list can include one or more file names
denovo_sequence_filenames = ['data/Tetillidae_denovo_sequence.fasta'] 
pj.read_denovo(denovo_sequence_filenames, 'dna', format='fasta')


Out[3]:
1

This is how the 'denovo' record looks like if we ask to print it in GenBank format:


In [4]:
for r in pj.records:
    if r.id == 'denovo0': print r.format('genbank')


LOCUS       denovo0                 2092 bp    DNA              UNK 01-JAN-1980
DEFINITION  NIWA2850 Craniella microsigma cox1
ACCESSION   denovo0
VERSION     denovo0
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     source          1..2092
                     /feature_id="denovo0_source"
                     /original_id="NIWA2850"
                     /original_desc="Craniella microsigma cox1"
ORIGIN
        1 atgataggaa ctggatttag cttgcttatt agattagaac tatccgctcc cggattaatg
       61 ttgggtgacg accatttata caatgttatg gtcacggccc acggtcttat aatggtcttt
      121 ttcttagtta tgccggttat gataggtggg ttcggtaatt gaatggttcc cctttacatc
      181 ggggcaccgg atatggcttt tccaagatta aacaatatta gtttttgagt tttacccccc
      241 tcattaatac tactgctagg ttctgctttt gttgaacaag gggttgggac aggatggacc
      301 ctttatccac cattatcaag tatacaggct cattctgggg gctcagtcga tgcggcaatt
      361 tttagtcttc atttggctgg gatctcttca attttagggg caatgaattt tataactact
      421 atctttaata tgcgggcacc tgggattacc atggatagat tgcctctatt tgtttgatct
      481 attttaataa caacttattt gttattatta gctttgccag tattggctgg tgccataact
      541 atgcttttaa cagatagaaa tttcaataca acgttcttcg atcccgctgg tggtggggac
      601 ccaatattat ttcaacattt attttggttc tttgggcatc cggaagttta tgtactagtt
      661 ctccccgggt ttggaattgt ttctcagatt attccaacat tcgcggctaa aaaacaaata
      721 tttggctatc tagggatggt ctatgctatg gtttctatag gaattttagg ttttatagtt
      781 tgagcttgta gatgggcgtg cgatagagtg atctatcgta gtataacatg actgtatgct
      841 ggaaagccta aaaaaagaaa ttcattaatt actcgtaatg acaggttcga tacagtaaaa
      901 atattaatgg agggtcaatc agcaggcaac ggtatagttt atactggagc ctcagagact
      961 acacgtcatg cccttgagga tgatttatat tgagctattg gtttatttga ggccgaagga
     1021 accttaaaga taagtaaggg tcggatctat attagtgcgt gtcaatcgac tagtaatata
     1081 aaggtccttt accgaataaa gggaatattt tgtttagggg gcgttaaaat aagaaaagac
     1141 ccccgttata gtgattgaaa gttagggagc gatttaaata aaatagtaaa attattagtt
     1201 tattaatgga gactaatcac tagaaagaaa aatatacaat tagtagagtt gataaagttc
     1261 ataaattgta aatattttcc taatttggtt gaatactttg gtttagagta ttgaccccga
     1321 ataatgcttg attaactggg tttgttgagg gagatggaaa cttaaatatt cagataagac
     1381 cacaccagtg gcggcccgaa tttcgattac acaaaaagag agagatgttt agatttaatc
     1441 aatgatattt ttcctgggtc catttgggct tcaggcaatc catcagaaca ctttaaatat
     1501 tcggcggggt caataagaac tcgaagtgac tggataaaat attttactag gtatccattt
     1561 aaggggaata aaaatattca atatgtgcgt tggttgaaat gccataatat tgttattcaa
     1621 ggtctacaca aaaccgagaa ggggttagct caaattaaat caatttggac tcaaggtgaa
     1681 gatatagtcc aatcccctta gtaatagggg ggtataacac gattgagtgt tgtaatttaa
     1741 gcatcacatg tttacagttg gaatggatgc cgactctagg gcatacttta gcgctgcaac
     1801 gatgataatc gccgtaccaa ccggaataaa aatctttagt tggatcgcta cagtagtagg
     1861 gggctcattg agaatagata ctcctatgtt atgggctatg ggatttgttt ttttatttac
     1921 tgtaggagga ttaaccggaa ttgtggtagc aagtaattct ttagatgtgt tgctccacga
     1981 cacatattac gttgttgctc attttcatta tgttctatcc atgggggcta tctttgctat
     2041 ctttggaggg gtttattatt gatttggtaa aattactggt tattgttaca ac
//

The record was assigned the ID 'denovo0', and a 'source' feature was created, including the fasta header as the 'original_id' and 'original_desc' qualifiers. However, it has no feature to indicate what locus it is and it will be ignored down the line. It is now up to us to add such a feature. Note that for large scale data, such as Exonerate results, other methods apply and will be discussed later.

3.3.2.2 Adding features

Here we only have one new sequence and we know its ID - 'denovo0' so it is easy enough to add a feature:


In [5]:
pj.add_feature_to_record('denovo0', 'CDS', qualifiers={'gene': 'cox1'})


Out[5]:
'denovo0_f0'

Feature 'denovo0_f0' was created.

Often we would want to assign gene names to a whole lot of sequences based on one name we recognize in the fasta header. We can create a dictionary that will specify the gene and feature type of each sequence:

feature_lookup = {'NIWA2850': ['CDS','cox1'],
                     # If we had more sequences we would add them here:
                     # 'Seq2': ['18S', 'rRNA'],
                     # ...
                    }

Now we can use this dictionary to create the feature:

for r in pj.records:

    source = r.features[0]
    quals = source.qualifiers

    if 'original_id' in quals and quals['original_id'][0] in feature_lookup:

        original_id = quals['original_id'][0]
        feature_type = feature_lookup[original_id][0]
        gene = feature_lookup[original_id][1]
        pj.add_feature_to_record(r.id, feature_type, qualifiers={'gene': gene})

The add_feature_to_record method allows to limit the feature to just a part of the sequence and to add any number of qualifiers. Look it up in the module reference.

This is how the record looks now, with the new feature added:


In [6]:
for r in pj.records:
    if r.id == 'denovo0': print r.format('genbank')


LOCUS       denovo0                 2092 bp    DNA              UNK 01-JAN-1980
DEFINITION  NIWA2850 Craniella microsigma cox1
ACCESSION   denovo0
VERSION     denovo0
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     source          1..2092
                     /feature_id="denovo0_source"
                     /original_id="NIWA2850"
                     /original_desc="Craniella microsigma cox1"
     CDS             1..2092
                     /feature_id="denovo0_f0"
                     /GC_content="37.3804971319"
                     /gene="cox1"
                     /nuc_degen_prop="0.0"
ORIGIN
        1 atgataggaa ctggatttag cttgcttatt agattagaac tatccgctcc cggattaatg
       61 ttgggtgacg accatttata caatgttatg gtcacggccc acggtcttat aatggtcttt
      121 ttcttagtta tgccggttat gataggtggg ttcggtaatt gaatggttcc cctttacatc
      181 ggggcaccgg atatggcttt tccaagatta aacaatatta gtttttgagt tttacccccc
      241 tcattaatac tactgctagg ttctgctttt gttgaacaag gggttgggac aggatggacc
      301 ctttatccac cattatcaag tatacaggct cattctgggg gctcagtcga tgcggcaatt
      361 tttagtcttc atttggctgg gatctcttca attttagggg caatgaattt tataactact
      421 atctttaata tgcgggcacc tgggattacc atggatagat tgcctctatt tgtttgatct
      481 attttaataa caacttattt gttattatta gctttgccag tattggctgg tgccataact
      541 atgcttttaa cagatagaaa tttcaataca acgttcttcg atcccgctgg tggtggggac
      601 ccaatattat ttcaacattt attttggttc tttgggcatc cggaagttta tgtactagtt
      661 ctccccgggt ttggaattgt ttctcagatt attccaacat tcgcggctaa aaaacaaata
      721 tttggctatc tagggatggt ctatgctatg gtttctatag gaattttagg ttttatagtt
      781 tgagcttgta gatgggcgtg cgatagagtg atctatcgta gtataacatg actgtatgct
      841 ggaaagccta aaaaaagaaa ttcattaatt actcgtaatg acaggttcga tacagtaaaa
      901 atattaatgg agggtcaatc agcaggcaac ggtatagttt atactggagc ctcagagact
      961 acacgtcatg cccttgagga tgatttatat tgagctattg gtttatttga ggccgaagga
     1021 accttaaaga taagtaaggg tcggatctat attagtgcgt gtcaatcgac tagtaatata
     1081 aaggtccttt accgaataaa gggaatattt tgtttagggg gcgttaaaat aagaaaagac
     1141 ccccgttata gtgattgaaa gttagggagc gatttaaata aaatagtaaa attattagtt
     1201 tattaatgga gactaatcac tagaaagaaa aatatacaat tagtagagtt gataaagttc
     1261 ataaattgta aatattttcc taatttggtt gaatactttg gtttagagta ttgaccccga
     1321 ataatgcttg attaactggg tttgttgagg gagatggaaa cttaaatatt cagataagac
     1381 cacaccagtg gcggcccgaa tttcgattac acaaaaagag agagatgttt agatttaatc
     1441 aatgatattt ttcctgggtc catttgggct tcaggcaatc catcagaaca ctttaaatat
     1501 tcggcggggt caataagaac tcgaagtgac tggataaaat attttactag gtatccattt
     1561 aaggggaata aaaatattca atatgtgcgt tggttgaaat gccataatat tgttattcaa
     1621 ggtctacaca aaaccgagaa ggggttagct caaattaaat caatttggac tcaaggtgaa
     1681 gatatagtcc aatcccctta gtaatagggg ggtataacac gattgagtgt tgtaatttaa
     1741 gcatcacatg tttacagttg gaatggatgc cgactctagg gcatacttta gcgctgcaac
     1801 gatgataatc gccgtaccaa ccggaataaa aatctttagt tggatcgcta cagtagtagg
     1861 gggctcattg agaatagata ctcctatgtt atgggctatg ggatttgttt ttttatttac
     1921 tgtaggagga ttaaccggaa ttgtggtagc aagtaattct ttagatgtgt tgctccacga
     1981 cacatattac gttgttgctc attttcatta tgttctatcc atgggggcta tctttgctat
     2041 ctttggaggg gtttattatt gatttggtaa aattactggt tattgttaca ac
//

Through the qualifiers dictionary, we can also attempt to add a translation of the sequence. We can also define a location for the feature, as a subset of the whole sequence :

qualifiers={'gene': 'cox1',
            'transl_table': 4,
            'codon_start': 1,
            'organism': 'Craniella microsigma'}

for record in pj.records:
   if 'denovo' in record.id: # New sequences are assigned with IDs starting

                         # with 'denovo'
   pj.add_feature_to_record(record.id, 'CDS',
                            # The location is specified as a list
                            # of lists. Every sub-list is an exon
                            # and has the start, the end and the strand.
                            # The numbers are "real" positions and not
                            # machine. ie, counting starts from 1.
                            location=[[1,786,1],[1742,2092,1]],
                            qualifiers=qualifiers)

</pre> transl_table is the genetic code to use in order to translate the coding sequence into a protein. The number, 4 in this case, specify the table to use, out of the GenBank genetic code tables.

3.3.3 Reading sequence alignments

ReproPhylo allows to read prealigned sequences in any of the Biopython AlignIO compatible formats, as follows:

pj.read_alignment('Another_locus.nex', 'dna', 'CDS', 'ND5', format='nexus')
This will place the alignment in the Project.alignments attribute (pj.alignments in this case) and the unaligned sequences as records in Project.records. There must be a Locus object in pj.loci, that is compatible with the character (dna) feature type (CDS) and the locus name (ND5) specified in the read_alignmnet command. The records will be assigned ‘denovo’ IDs, and the nexus sequence names will be stored in the ‘original_id’ qualifiers. ‘original_desc’ qualifier remain empty in this case, because nexus files don’t have them.

3.3.4 Reading a Nexus alignmnet with PAUP commands

Many published datasets are available in nexus format with charset commands that describe the data partitions. ReproPhylo can read such a matrix, split the partitions into individual alignments and place them in Project.alignments, and then put each sequence from each partition in Project.records. This facilitates experimentation with the data composition. It is even possible to turn such a nexus file directly into a new Project instance with all the information set up. To do that use the following command:

nexus_filename = 'data/some_supermatrix.nex'

pj = pj_from_nexus_w_charset(nexus_filename,

                             'data',             # path to write intermediate fasta file    

                             'dna',              # Character type ('dna' or 'prot')

                             'CDS',              # Feature type (Any)

                             project = True,     # Will return a Project instance instead of a list
                                                 # of fasta files per partition 
                                                 # if project will save it to this file:
                             pickle = 'new_pickle_name',

                             git = True)         # Will start and manage repository

In [8]:
# Update the pickle file
pickle_pj(pj, 'outputs/my_project.pkpj')


Out[8]:
'outputs/my_project.pkpj'

3.3.5 Quick reference


In [ ]:
# Read GenBank or embl files
input_filnames = ['file1', 'file2']
pj.read_embl_genbank(input_filnames)

# Read other formats
denovo_sequence_filenames = ['file1.fasta', 'file2.fasta'] 
pj.read_denovo(denovo_sequence_filenames, 'dna', format='fasta')
#Or
pj.read_denovo(denovo_sequence_filenames, 'prot', format='fasta')

# Add asequence feature to a record
pj.add_feature_to_record('someRecordID', 'CDS', qualifiers={'gene': 'cox1'})
# Or
qualifiers={'gene': 'cox1',
            'transl_table': 4,
            'codon_start': 1,
            'organism': 'Craniella microsigma'}

pj.add_feature_to_record(someRecordID, 'CDS',
                         location=[[1,786,1],[1742,2092,1]],
                         qualifiers=qualifiers)

# Read a sequence alignment
pj.read_alignment('Another_locus.nex', 'dna', 'CDS', 'ND5', format='nexus')

# Read a Nexus alignment with a super-matrix and charset commands
nexus_filename = 'data/some_supermatrix.nex'
pj = pj_from_nexus_w_charset(nexus_filename,
                             'data',
                             'dna',
                             'CDS',
                             project = True,
                             pickle = 'new_pickle_name',
                             git = True)